Multilevel Measures of Document Similarity
نویسندگان
چکیده
Many applications such as document summarization, passage retrieval and question answering require a detailed analysis of semantic relations between terms within and across documents and sentences. Often one has a number of sentences or paragraphs and has to choose the candidate with the highest level of relevance for the topic or question. An additional requirement may be that the information content of the next candidate is different from the sentences that are already chosen. Many approaches to information retrieval and document classification model the semantic similarity between documents using the relations between semantic classes of words. They include representing dimensions of the document vectors with distributional term clusters (?) and expanding the document and query vectors with synonyms and related terms as discussed in (?). Latent Semantic Analysis (LSA) (?) is one of the best known dimensionality reduction algorithms. It represents documents as vectors in the space of latent semantic concepts. Latent Dirichlet Allocation (LDA) (?) uses the latent semantic concepts as bottleneck variables in computing the term distributions for documents. The new representation captures overall semantic similarity between documents but is less sensitive to differences on the sentence level. Moreover, the methods include all vocabulary terms in their computations which limits their applicability. Semantic similarity on the word level is targeted for word sense disambiguiation (WSD), e.g. Schütze (?), verb classification XXX(cite D. Lin). The research has shown that different measures of similarity may be required for different groups of terms such as nouns and verbs. It also reasonalbe to use different notions of similarity for content bearing general vocabulary words and named entities. Methods of WSD are usually use co-occurrence statistics. Verb similarity measures is based on syntactic similarity. In this project, we propose to use a combination of similarity measures between terms to model document similarity. We divide the vocabulary into general vocabulary terms and named entities and compute a separate similarity score for each of the group of terms. The overall similarity score is a function of these two scores. In addition, we use statistical cooccurrence as well as syntactic similarity to compute the similarity between the general vocabulary terms.
منابع مشابه
Document Representation and Multilevel Measures of Document Similarity
We present our work on combining largescale statistical approaches with local linguistic analysis and graph-based machine learning techniques to compute a combined measure of semantic similarity between terms and documents for application in information extraction, question answering, and summarisation.
متن کاملبررسی قابلیت بهکارگیری سنجه های مرکزیت به عنوان شاخصهای ارتباط استنادی مدارک در بازیابی اطلاعات رابطه ای: مطالعۀ مقدماتی
Purpose: this is a pilot study tends to investigate correlation between centrality measures with bibliographic coupling as a well-known citation-based document similarity measure. Methodology: using citation analysis method, 40 research articles belonging to four engineering/pure disciplines (Physics, Chemistry, Biology, and computer) and four Humanities and Social disciplines (Economics, Edu...
متن کاملHESITANT FUZZY INFORMATION MEASURES DERIVED FROM T-NORMS AND S-NORMS
In this contribution, we first introduce the concept of metrical T-norm-based similarity measure for hesitant fuzzy sets (HFSs) {by using the concept of T-norm-based distance measure}. Then,the relationship of the proposed {metrical T-norm-based} similarity {measures} with the {other kind of information measure, called the metrical T-norm-based} entropy measure {is} discussed. The main feature ...
متن کاملSOME SIMILARITY MEASURES FOR PICTURE FUZZY SETS AND THEIR APPLICATIONS
In this work, we shall present some novel process to measure the similarity between picture fuzzy sets. Firstly, we adopt the concept of intuitionistic fuzzy sets, interval-valued intuitionistic fuzzy sets and picture fuzzy sets. Secondly, we develop some similarity measures between picture fuzzy sets, such as, cosine similarity measure, weighted cosine similarity measure, set-theoretic similar...
متن کاملCombining Multilevel and Multifeature Representation to Compute Melodic Similarity
In the proposed approach, melodic similarity is computed as a content-based information retrieval task. To this end, the initial incipit is considered as the query in a query-byexample paradigm and the ranked list of potentially similar documents is given by the list of retrieved documents. The approach to retrieval is based on document indexing, where each document is described by alternative ...
متن کامل